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ABSTRACT 

Matching dependencies were recently introduced as declar- 
ative rules for data cleaning and entity resolution. Enforc- 
ing a matching dependency on a database instance identifies 
the values of some attributes for two tuples, provided that 
the values of some other attributes are sufficiently similar. 
Assuming the existence of matching functions for making 
two attributes values equal, we formally introduce the pro- 
cess of cleaning an instance using matching dependencies, 
as a chase-like procedure. We show that matching func- 
tions naturally introduce a lattice structure on attribute do- 
mains, and a partial order of semantic domination between 
instances. Using the latter, we define the semantics of clean 
query answering in terms of certain/possible answers as the 
greatest lower bound/least upper bound of all possible an- 
swers obtained from the clean instances. We show that clean 
query answering is intractable in some cases. Then we study 
queries that behave monotonically wrt semantic domination 
order, and show that we can provide an under /over approx- 
imation for clean answers to monotone queries. Moreover, 
non-monotone positive queries can be relaxed into monotone 
queries. 

1. INTRODUCTION 

Matching dependencies (MDs) in relational databases were 
recently introduced in [16] as a means of codifying a domain 
expert's knowledge that is used in improving data quality. 
They specify that a pair of attribute values in two database 
tuples are to be matched, i.e., made equal, if similarities 
hold between other pairs of values in the same tuples. This 
is a generalization of entity resolution [T^, where basically 
full tuples have to be merged or identified since they seem 
to refer to the same entity of the outside reality. This form 
of data fusion [ID] is important in data quality assessment 
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and in data cleaning. 

Matching dependencies were formally studied in [17], as 
semantic constraints for data cleaning and were given a 
model-theoretic semantics. The main emphasis in that pa- 
per was on the problem of entailment of MDs and on the 
existence of a formal axiom system for that task. 

MDs as presented in [l7] do not specify how the match- 
ing of attribute values is to be done. In data cleaning, the 
user, on the basis of his or her experience and knowledge 
of the application domain, may have a particular method- 
ology or heuristics for enforcing the identifications. In this 
paper we investigate MDs in the context of matching func- 
tions. These are functions that abstract the implementa- 
tion of value identification. Rather than investigate specific 
matching functions, we explore a class of matching func- 
tions satisfying certain natural and intuitive axioms. With 
these axioms, matching functions impose a lattice-theoretic 
structure on attribute domains. Intuitively, given two input 
attribute values that need to be made equal, the match- 
ing function produces a value that contains the information 
present in the two inputs and semantically dominates them. 
We show this semantic domination partial order can be nat- 
urally lifted to tuples of values as well as database instances 
as sets of tuples. 

Example 1 . Consider the following database instance Do . 
Assume there is a matching dependency stating that if for 
two tuples the values of name and phone are similar, then 
the value of address should be made identical. Consider 
a similarity relation that indicates the values of name and 
phone are similar for the two tuples in this instance. To en- 
force the matching dependency, we create another instance 
Di in which the value of address for two tuples is the re- 
sult of applying a matching function on the two previous 
addresses. This function combines the information in those 
address values. 



Do 


name 


phone 


address 




John Doe 


(613)123 4567 


Main St., Ottawa 




J. Doe 


123 4567 




25 Main St. 


Di 


name 


phone 




address 




John Doe 


(613)123 4567 




25 Main St., Ottawa 




J. Doe 


123 4567 




25 Main St., Ottawa 



We can continue this process in a chase-like manner if there 
are still other MD violations in Di. ■ 



The framework of [T7] leaves the implementation details of 
data cleaning process with MDs completely unspecified and 



implicitly leaves it to the application on hand. We point 
out some limitations of the proposal in [17] for purposes of 
cleaning dirty instances in the presence of multiple MDs, and 
show that a formulation of the formal semantics of the sat- 
isfaction and enforcement of MDs, incorporating matching 
functions, remedies this problem. In giving such a formu- 
lation, we revisit the original semantics for MDs proposed 
in [17], propose some changes and investigate their conse- 
quences. More precisely, we define intended clean instances, 
those that are obtained through the application of the MDs 
in a chase-like procedure. We further investigate properties 
of this procedure in relation to the properties of the match- 
ing functions, and show that, in general, the chase procedure 
produces several different clean instances, each of which se- 
mantically dominates the original dirty instance. 

We then address the problem of query answering over a 
dirty instance, where the MDs do not hold. We take advan- 
tage of the semantic domination order between instances, 
and define clean answers by specifying a tight lower bound 
(corresponding to certain answers) and a tight upper bound 
(corresponding to possible answers) for all answers that can 
be obtained from any of the possibly many clean instances. 
We show that computing the exact bounds is intractable in 
general. However, in polynomial time we can generate an 
under-approximation for certain answers as well as an over- 
approximation for possible answers for queries that behave 
monotonically w.r.t. the semantic domination order. 

We argue that monotone queries provide more informative 
answers on instances that have been cleaned with MDs and 
matching functions. We therefore introduce new relational 
algebra operators that make use of the underlying lattice 
structure on the domain of attribute values. These operators 
can be used to relax a regular positive relational algebra 
query and make it monotone w.r.t. the semantic domination 
order. 

Recently, Swoosh [S] has been proposed as generic frame- 
work for entity resolution. In entity resolution, whole tu- 
ples are identified, or merged into a new tuple, whenever 
similarities hold between the tuples on some attributes. Ac- 
cordingly, the similarity and matching functions work at the 
tuple level. Given their similarity of purpose, it is interest- 
ing to ask what is the relationship between the frameworks 
of MDs and of Swoosh. We address this question in this 
paper. 

In summary, we make the following contributions: 

• We identify the limitations of the original proposal of 
MDs [13 wrt the application of data cleaning in the 
presence of multiple MDs and show that the limita- 
tions can be overcome by considering MDs along with 
matching functions. 

• We study matching functions in terms of their prop- 
erties, which are certain intuitive and natural axioms. 
Matching functions induce a lattice framework on at- 
tribute domains which can be lifted to a partial order 
over instances, that we call semantic domination. 

• We formally characterize answering a query given a 
dirty instance and a set of MDs, and capture it using 
certain and possible answers. Computing these an- 
swers is intractable in general. For queries that are 
monotone wrt the semantic domination relation, we 
develop a polynomial time heuristic procedure for ob- 



taining under- and over-approximations of query an- 
swers. 

• We demonstrate the power of the framework of MDs 
and of our chase procedure for MD application by re- 
constructing the most common case for Swoosh, the 
so-called union and merge case, in terms of matching 
dependencies with matching functions. 

The paper is organized as follows. In Section [2] we provide 
necessary background on matching dependencies as origi- 
nally introduced. We introduce matching functions and the 
notion of semantic domination in Section |3] Then we de- 
fine the data cleaning process with MDs in Section jj] We 
explore the semantic of query answering in Section [5] In 
Section [6] we study monotone queries and show how clean 
answers can be approximated. We establish a connections 
to an important related work. Swoosh, in Section [7] and 
present concluding remarks in Section [S] 

2. BACKGROUND 

A database schema IZ is. & set . . . , Rn} of relation 
names. Every relation Ri is associated with a set of at- 
tributes, written as Ri{Ai, . . . , Am), where each attribute 
Aj has a domain DoruAj ■ We assume that attribute names 
are different across relations in the schema, but two at- 
tributes Aj,Ak can be comparable, i.e., DomAj = DomA^- 
An instance D of schema TZ assigns a finite set of tuples 
to every relation Ri, where t^ can be seen as a function 
that maps every attribute Aj in Ri to a value in DomAj ■ 
We write to refer to this value. When X is a list of 

attributes, we may write to refer to the correspond- 

ing list of attribute values. A tuple t^ for a relation name 
i? G 7?. is called an 7?-tuple. We deal with queries Q that are 
expressed in relational algebra, and treat them as operators 
that map an instance D to an instance Q(D). 

For every attribute A in the schema, we assume a binary 
similarity relation ~a C DomA x DomA- Notice that when- 
ever A and A' are comparable, the similarity relations r^A 
and ~4/ are identical. We assume that each ~a is sym- 
metric and subsumes equality, i.e., =DomA ^ ~a- When 
there is no confusion, we simply use ~ for the similarity 
relation. In particular, for lists of pairwise comparable at- 
tributes, Xi = A\,. . . , All, i ~ 1,2, we write Xi ~ X2 to 
mean Al ~i A ■ ■ ■ A Al^ ~„ A^, where is the similarity 
relation applicable to attributes A\,A\. 

Given two pairs of pairwise comparable attribute lists 
X\,X2 and ¥±,¥2 from relations R\,R2, resp., a matching 
dependency (MD) [17] is a sentence of the form 

ifi: Ri[Xi\ ^ R2[X2] ^ Ri[Yi\ ^ i?2[F2]Q (1) 

This dependency intuitively states that if for an i?i -tuple 
ti and an _R2-tuple t2 in instance D, the attribute values 
in tP[A'i] are similar to attribute values in t2[X2], then we 
need to make the values iffVi] and t2'[^2] pairwise identical. 

Enforcing MDs may cause a database instance D to be 
changed to another instance D' . To keep track of every 
single change, we assume that every tuple in an instance has 
a unique identifier t, which could identify it in both instance 
D and its changed version D' . We use t^ and t^ to refer to 

^AU the variables in Xi,Yj are implicitly universally quan- 
tified in front of the formula. 



a tuple and its changed version in D' that has resulted from 
enforcing an MD. For convenience, we may use the terms 
tuple and tuple identifier interchangeably. 

Fan et al. [l7] give a dynamic semantics for matching de- 
pendencies in terms of a pair of instances; one where the 
similarities hold, and a second where the specified identifi- 
cations have been enforced: 

Definition 1. [T7] A pair of instances (D, D') satisfies 
the MD if ■ ^ R2[X2\ Ri[Yi\ ^ R2[Y2\, denoted 

(D, D') \= tp, if for every i?i-tuple ti and 7?2-tuple t2 in D 
that match the left-hand side of i.e., if [Xi] « if [X2], 
the following holds in the instance D' : 

(a) tf [Yi] = t§ [y2], i.e., the values of the right-hand side 
attributes of ip have been identified in D' \ and 

(b) t\,t2 in D' match the left-hand side of ip, that is, 

t?'\X,\^t^'[X2]. 

For a set S of MDs, D') ^ E iff {D, D') \= ip for every 
VP G S. An instance D' is called stable 'd {D' , D') \= T.. ■ 

Notice that a stable instance satisfies the MDs by itself, 
in the sense that all the required identifications are already 
enforced in it. So, whenever we say that an instance is dirty, 
we mean that it is not stable w.r.t. the given set of MDs. 

While this definition may be sufficient for the implication 
problem of MDs considered by Fan et al. [17], it does not 
specify how a dirty database should be updated to obtain 
a clean instance, especially when several interacting updates 
are required in order to enforce all the MDs. Thus, it does 
not give a complete prescription for the purpose of cleaning 
dirty instances. Moreover, from a different perspective, the 
requirements in the definition may be too strong, as the 
following example shows. 

Example 2. Consider the set of MDs E consisting of pi: 
R[A] « R[A] R[B] ^ R[B] and p2: R[B, C] « R[B, C] 
R[D] ^ R[D]. The similarities are: ai ~ 02, &2 ~ 63, 
C2 ~ C3. Instance Do below is not a stable instance, i.e., it 
does not satisfy p\,p2- We start by enforcing p\ on Dq. 



Do 


A 


B 


C 


D 


Di 


A 


B 


C 


D 




ai 


bi 


Cl 


di 




ai 


{hiM) 


Cl 


di 




a2 


b2 


C2 


d2 




a2 




C2 


d2 




as 




C3 


ds 




as 


bs 


C3 


ds 



Let (&i,62) in instance Di denote the value that replaces 
61 and 62 to enforce pi on instance -Do, and assume that 
(fei,62> ^ bs. Now, {Do,Di) \= p>i. However, {Do,Di) ^ 

'P2- 

If we identify d2,ds via {d2,ds) producing instance D2, 
the pair {Do,D2) satisfies the condition (a) in Definition [1] 
for <p2, but not condition (b). Notice that making more 
updates on Di (or D2) to obtain an instance D' , such that 
{Do,D') \= E, seems hopeless as p2 will not be satisfied 
because of the broken similarity that existed between 62 and 
63. ■ 

Definition [1] seems to capture well the one-step enforcement 
of a single MD. However, as shown by the above example, 
the definition has to be refined in order to deal with sets of 
interacting MDs and to capture an iterative process of MD 
enforcement. We address this problem in Section |4l 



Another issue worth mentioning is that stable instances 
D' for D and E are not subject to any form of minimality 
criterion on D' in relation with D. We would expect such 
an instance to be obtained via the enforcement of the MDs, 
without unnecessary changes. Unfortunately, this is not the 
case here: If in Example [2] we keep only i^i, and in instance 
Di we change as by an arbitrary value a^ that is not similar 
to either ai or 02, we obtain a stable instance with origin in 
Do, but the change of 03 is unjustified and unnecessary. We 
will also address this issue. 

Following [17], we assume in the rest of this paper that 
each MD is of the form Ri[Xi] ^ R2[X2] Ri[Ai] ^ 
R2[A2]. That is, the right-hand side of the MDs contains a 
pair of single attributes. We also assume that the sets E of 
MDs we consider are always finite. 

3. MATCHING FUNCTIONS AND SEMAN- 
TIC DOMINATION 

In order to enforce a set of MDs (cf. Section |3| we need 
an operation that identifies two values whenever necessary. 
With this purpose in mind, we will assume that for each com- 
parable pair of attributes Ai,A2 with domain DomA, there 
is a binary matching function ma ■ DomA x DomA — >■ DomA, 
such that the value mA{a,a') is used to replace two values 
a, a' £ DomA whenever the two values need to be made 
equal. Here are a few natural properties to expect of the 
matching function niA (similar properties were considered in 
0, cf. Section[7|: For a, a', a" G DomA, 

I (Idempotency): mA{a,a) = a, 

C (Commutativity) : mA{a,a ) = mA{a ,a), 

A (Associativity): mA{a, mA{a' , a")) = raA(raA(fi, ci'), a"). 

It is reasonable to assume that any matching function sat- 
isfies at least these three axioms. Under this assumption, 
the structure {DomA, ma) forms a join semilattice, La, that 
is, a partial order with a least upper bound {lub) for every 
pair of elements. The induced partial order <a on the ele- 
ments of DomA is defined as follows: For every a, a' G DomA, 
a <A a whenever mA{a,a') = a'. The lub operator with re- 
spect to this partial order coincides with ma' lub^ ^ {a, a'} = 
mA{a,a). 

A natural interpretation for the partial order in the 
context of data cleaning would be the notion of semantic 
domination. Basically, for two elements a, a' G DomA, we 
say that a' semantically dominates a if we have a ^a a' . 
In the process of cleaning the data by enforcing matching 
dependencies, we always replace two values a, a', whenever 
certain similarities hold, by the value mA{a,a') that seman- 
tically dominates both a and a' . This notion of domination 
is also related to relative information contents [111 1211 [22] . 

To define the semantics of query answering on instances 
that have been cleaned with matching dependencies, we 
might, in addition, need the existence of the greatest lower 
bound {gib) for any two elements in the domain of an at- 
tribute. We therefore assume that {DomA, ma) is a lattice 
(i.e., both lub and gib exist for every pair of elements in 
DomA w.r.t. :<a)- Moreover, there is a special element 
± G DomA such that mA(a,-L) ~ a, for every a G DomA- 
Notice that if we add an additional assumption to the semi- 
lattice, which requires that for every element a G DomA, 
the set {c G DomA \ c a} (the set of elements c with 
mA(o,c) — a), is finite, then glb^^{a,a'} does exist for ev- 



ery two elements a, a' G DoniA and is equal to lub^ ^{c £ 
DorriA \ c <A a and c a'}- We could also assume 
the existence of another special element T G DoniA such 
that mA(a, T) = T, for every a £ DoruA- This element 
could represent the existence of inconsistency in data when- 
ever matching dependencies force to match two completely 
unrelated elements a, a' from the domain, in which case 
mA(ci,a') = T. However, the existence of T is not essen- 
tial m our framework. 

Example 3. We give a few concrete examples of match- 
ing functions for different attribute domains. Our example 
functions have all the properties I, C, and A. 

Name, Address, Phone Each atomic value s of these string 
domains could be treated as a singleton set {s}. Then 
a matching function m{Si,S2) for sets of strings Si 
could return 5*1 U S2. E.g., when matching addresses, 
m({'2366 Main MaU'}, {'Main Mall, Vancouver'}) could 
return {'2366 Main Mali', 'Main MaU, Vancouver'}. 
(This union matching function is further investigated 
in Section!?]) Alternatively, a more sophisticated match- 
ing function could merge two input strings into a third 
string that contains both of the inputs. E.g., the match 
of the two input strings above could instead be '2366 
Main Mall, Vancouver'. 

Age, Salary, Price Each atomic value a in these numeri- 
cal domains could be treated as an interval [a,a\. Then 
the matching function m([ai,6i], [02,^2]) would return 
the smallest interval containing both [ai , bi] and [02, 62], 
i.e., m([ai, 61], [02, 62]) = [mm{ai, 02}, max{b\, 62}]. 

Boolean Attributes For attributes which take either a or 
1 value, the matching function would return m(0, 1) = 
T, where T shows inconsistency in the data, and fur- 
thermore m(0,T) = T and m(l,T) = T. In this case, 
the purpose of applying the matching function is to 
record the inconsistency in the data and still conduct 
sound reasoning in presence of inconsistency0 ■ 

An additional property of matching functions worthy of con- 
sideration is similarity preservation, that is, the result of 
applying a matching function preserves the similarity that 
existed between the old value being replaced and other val- 
ues in the domain. More formally: 

S (Similarity Preservation): If a ~ a , then a « mA^a' , a"), 
for every a, a', a" G DomA- 

Unlike the previous properties (I, C, A), property S turns 
out to be a strong assumption, and we must consider both 
matching functions with S and without it. Indeed, notice 
that since ~ subsumes equality, and mA is commutative, 
assumption S implies a ~ mA{ci,a') and a' « mA{a,a'), i.e., 
similarity preserving matching always results in a value sim- 
ilar to the value being replaced. In the rest of the paper, we 
assume that for every comparable pair of attributes A\,A2, 
there is an idempotent, commutative, and associative binary 
matching function ma ■ Unless otherwise specified, we do not 
assume that these functions are similarity preserving. 



^Matching of boolean attributes requires the existence of the 
top element T. 



Definition 2. Let Z?i , Z?2 be instances of schema 71., and 
ti,t2 be two -R-tuples in Di,D2, respectively, with R £ TZ. 
We write tf^ ^ tf= if ^ [A] <a t°^A] for every attribute 
A in R. We write Di C D2 if for every tuple ii in Di , there 
is a tuple t2 in D2, such that tf^ ^ t^^ . ■ 

Clearly, the relation ^ on tuples can be applied to tuples in 
the same instance. The ordering □ on sets has been used 
in the context of complex objects [Bl [21] and also powerdo- 
mains, where it is called Hoare ordering [11]. It is also used 
in [H] for entity resolution (cf. Section [7]). It is known that 
for C to be a partial order, specifically to be antisymmetric, 
we need to deal with reduced instances [^, i.e., instances D 
in which there are no two different tuples ti,t2, such that 
t? ^ t?. We can obtain the reduced version of every in- 
stance D by removing every tuple ti , such that t^ -< ±2 for 
some tuple t2 in D. 

Next we will show that the set of reduced instances with 
the partial order C forms a lattice: the least upper bound 
and the greatest lower bound for every finite set of reduced 
instances exist. This result will be used later for query an- 
swering. We adapt some of the results from [6], where they 
prove a similar result for a lattice formed by the set of com- 
plex objects and the sub-object partial order. 

Definition 3. Let _Di , _D2 be instances of schema 7?., and 
ti,t2 be two i?-tuples in D\,D2, respectively, for R £TZ. 

(a) We define Di Y D2 to be the reduced version of Di UD2 , 
where Di U D2 refers to the instance that takes the 
union of _R-tuples from Di and D2 for every R £TZ. 

(b) We define ti X t2 to be tuple t, such that t[A\ = 
glb^ ^{t'^^[A\,t2^[A]} for every attribute A in R. 

(c) We define D\ X D2 to be the reduced version of the 
instance that assigns the set of tuples {ti X t2 \ ti £ 
Di, t2 € D2,ti, t2 -R-tuples} to every R £ TZ. ■ 

Next we show that the operations defined in Definition [3] 
are equivalent to the greatest lower bound and least upper 
bound of instances w.r.t. the partial order 

Lemma 1. For every two instances Di,D2 and i?-tuples 
ti,t2 in Di,D2, the following holds: 

1. Di Y D2 is the least upper bound of Di, D2 w.r.t. C. 

2. ti X t2 is the greatest lower bound of ti, t2 w.r.t. ^. 

3. Di X D2 is the greatest lower bound of Di,D2 w.r.t. 

In particular, we can see that < imposes a lattice structure 
on i?-tuples. Using Lemma [1] we immediately obtain the 
following result. 

Theorem 1. The set of reduced instances for a given 
schema with the C ordering forms a lattice. ■ 

4. ENFORCEMENT OF MDS AND CLEAN 
INSTANCES 

In this section, we define clean instances that can be ob- 
tained from a dirty instance by iteratively enforcing a set of 
MDs in a chase-like procedure. Let D, D' be two database 
instances with the same set of tuple identifiers, and ii, i2 be 



an _Ri-tuple and an i?2-tuple, respectively, in both D and 
D'. Let E be a set of MDs, and ip : Ri[Xi] « R2[X2] -> 
Ri[Ai] ^ R2[A2] be an MD in E. 

Definition 4. Instance D' is the immediate result of en- 
forcing Lp on ti,t2 in instance D, denoted by {D, i3')[ti.t2l N 
if, if 

1. t?[Xi] ^ t^[X2], but t?[Ai] / t^[A2]; 

2. tf =t?'[A2] = mA(tf [^il,t?[^2l); and 

3. -D, D' agree on every other tuple and attribute value. 



Definition [4] captures a single step in a chase-like procedure 
that starts from a dirty instance Do and enforces MDs step 
by step, by applying matching functions, until the instance 
becomes stable. We propose that the output of this chase 
should be treated as a clean version of the original instance, 
given a set of MDs, formally defined as follows. 

Definition 5. For an instance Do and a set of MDs E, 
an instance Dk is {Do,T,)-clean if Dk is stable, and there 
exists a finite sequence of instances Di, . . . , Dk-i such that, 
for every i £ [l,k], (A-i, |= p', for some (p' G T, 

and tuple identifiers 1 1 , 1 2 ■ B 

Notice that if {Do, Do) \= E, i.e., it is already stable, then Do 
is its only (Do, E)-clean instance. Moreover, we have Di-i C 
Di, for every i £ [1, fc], since we are using matching functions 
to identify values. In particular, we have Do C D^. In 
other words, clean instance Dk semantically dominates dirty 
instance Do, and we might say Dk it is more informative 
than Do. 

Theorem 2. Let E be a set of matching dependencies 
and Do be an instance. Then every sequence Di,D2,... 



such that {Di-i,Di 



\= (fi^, for some yj' € E and tuple 



identifiers t\,t2 in Di_i, is finite and computes a (Do,E)- 
clean instance Dk in polynomial number of steps in the size 
of Do. ■ 

In other words, the sequence of instances obtained by chas- 
ing MDs reaches a fixpoint after polynomial number of steps. 
That is, it is not possible to generate a new instance, because 
condition 1 in Definition |4] is not satisfied by the last gener- 
ated instance, i.e., the last instance is stable w.r.t. all MDs. 
This is the consequence of assuming that matching functions 
are idempotent, commutative, and associative. 

Observe that, for a given instance Do and set of MDs E, 
multiple clean instances may exist, each resulting from a dif- 
ferent order of application of MDs on Do and from different 
selections of violating tuples. It is easy to show that the 
number of clean instances is finite. 

Notice also that for a (Do, E)-clean instance Dk, we may 
have (Do,Dfe) ^ E (cf. Definition[l|. Intuitively, the reason 
is that some of the similarities that existed in Do could have 
been broken by iteratively enforcing the MDs to produce 
Dk- We argue that this is a price we may have to pay if 
we want to enforce a set of interacting MDs. However, each 
(Do, E)-clean instance is stable and captures the persistence 
of attribute values that are not affected by MDs. The follow- 
ing example illustrates these points. For simplicity, we write 
(ai, . . . , a;) to represent mA(ai, mA{a2, mA{- ■ ■ , ai))), which is 
allowed by the associativity assumption. 



Example 4. Consider the set of MDs E consisting of 
ipi: R[A] « R[A] R[B] ^ R[B] and (^2: R[B] « R[B] ^■ 
R[C] ^ R[C]. We have the similarities: ai ~ 02, 62 ~ 63- 
The following sequence of instances leads to a (Do , E)-clean 
instance D2. 
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ai (61,62) (ci,C2) 
a2 (61,62) (C1,C2) 
"3 63 C3 

However, (Do,D2) ^ E, and the reason is that (61,62) ~ 63 
does not necessarily hold. We can enforce the MDs in an- 
other order and obtain a different (Do, E)-clean instance: 
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Again, D3 is a (Do,E)-clean instance, but {Do,D'z) y= E. 



It would be interesting to know when there is only one 
(Do,E)-clean instance Dk, and also when, for a clean in- 
stance Dfe, (Do,Dfc) 1= E holds. The following two propo- 
sitions establish natural sufficient conditions for these prop- 
erties to hold. 

Proposition 1. Suppose that for every pair of compara- 
ble attributes A\,A2, the matching function ma is similarity 
preserving. Then, for every set of MDs E and every instance 
Do, there is a unique (Do,E)-clean instance Dk- Further- 
more, (Do, Dfc) N S. ■ 

We say that a set of matching dependencies E is interaction- 
free if for every two MDs ipiiVa £ E, the two sets of at- 
tributes on the right-hand side of </?i and left-hand side of 
(p2 are disjoint. The two sets of MDs in Examples [5] and U 
are not interaction- free. 

Proposition 2. Let E be an interaction-free set of MDs. 
Then for every instance Do, there is a unique (Do, E)-clean 
instance Dk- Furthermore, (Do,Dfc) |= E. ■ 

The chase-like procedure that produces a (Do,E)-clean in- 
stance makes only those changes to instance Do that are 
necessary, and are imposed by the dynamic semantics of 
MDs. In this sense, we can say that the chase implements 
minimal changes necessary to obtain a clean instance. 

Another interesting question is whether (Do, E)-clean in- 
stances are at a minimal distance to Do w.r.t. the partial 
order C. This is not true in general. For instance in Exam- 
ple IH observe that for the two (Do,E)-clean instances D2 
and D3, D2 C D3, but D3 % D2, which means D3 is not at 
a minimal distance to Do w.r.t. C. However, both of these 
clean instances may be useful in query answering, because, 



informally speaking, they can provide a lower bound and 
an upper bound for the possible clean interpretations of the 
original dirty instance w.r.t. the semantic domination. This 
issue is discussed in the next section. 

5. CLEAN QUERY ANSWERING 

Most of the literature on data cleaning has concentrated 
on producing a clean instance starting from a dirty one. 
However, the problem of characterizing and retrieving the 
data in the original instance that can be considered to be 
clean has been neglected. In this section we study this prob- 
lem, focusing on query answering. More precisely, given an 
instance D, a set E of MDs, and a query Q posed to D, we 
want to characterize the answers that are consistent with 
E, i.e., that would be returned by an instance where all the 
MDs have been enforced. Of course, we have to take into 
account that there may be several such instances. 

This situation is similar to the one encountered in consis- 
tent query answering (CQA) [3][9l[T3], where query answer- 
ing is characterized and performed on database instances 
that may fail to satisfy certain classic integrity constraints 
(ICs). For such a database instance, a repair is an instance 
that satisfies the integrity constraints and minimally differs 
from the original instance. For a given query, a consistent 
answer (a form of certain answer) is defined as the set of 
tuples that are present in the intersection of answers to the 
query when posed to every repair. A less popular alternative 
is the notion of possible answer, that is defined as the union 
of all tuples that are present in the answer to the query when 
posed to every repair. 

A similar semantics for clean query answering under match- 
ing dependencies can be defined. However, the partial order 
relationship C between a dirty instance and its clean in- 
stances establishes an important difference between clean 
instances w.r.t. matching dependencies and repairs w.r.t. 
traditional ICs. 

Intuitively, a clean instance has improved the information 
that already existed in the dirty instance and made it more 
informative and consistent. We would like to carefully take 
advantage of this partial order relationship and use it in the 
definition of certain and possible answers. We do this by 
taking the greatest lower bound (gib) and least upper bound 
(lub) of answers of the query over multiple clean instances, 
instead of taking the set-theoretic intersection and union. 

Let E be a set of MDs, Do be a database instance, and 
Q be a query posed to instance Dq. We define certain and 
possible answers as follows. 

CertQ{Do) = glb^{Q{D) | D is a (A),E)-clean instance}. (2) 
Possq{Do) = lubiz{Q{D) I D is a (Do, E)-clean instance}. (3) 

The gib and lub above are defined on the basis of the partial 
order C on sets of tuples. Since there is a finite number of 
clean instances for Do, these gib and lub exist (cf. Theorem 
In Eq. ([2| and ([3| we are assuming that each of the 
Q{D) is reduced (cf. Section[3l). By Definition[l CertQ{Do) 
and Possq{Do) are also reduced. Moreover, we clearly have 
CertQ{Do) C Possq{Do). 

The following example motivates these choices. It also 
shows that, unlike some cases of inconsistent databases and 
consistent query answering, certain answers could be quite 
informative and meaningful for databases with matching de- 
pendencies. 



Example 5. Consider relation R{name, phone, address), 
and set E consisting of the following MDs: 

(fii : R[name, phone, address] « R[name, phone, address] — 

R[address] ^ R[address], 
ip2 '■ R[phone, address] « R[phone, address] 

R[phone] ^ R[phone]. 

Suppose that in the dirty instance Do, shown below, the 
following similarities hold: 

"John Doe" « "J. Doe", 

"Jane Doe" « "J. Doe", 

"(613)123 4567" « "123 4567", 

"(604)123 4567" « "123 4567", 

"25 Main St." « "Main St., Ottawa", 

"25 Main St." ^ "25 Main St., Vancouver". 

Other non-trivial similarities that are not mentioned do not 
hold. Moreover, the matching functions act as follows: 

"<pho„e("(613)123 4567", "123 4567") = "(613)123 4567", 
mpho„e("123 4567", "(604)123 4567") = "(604)123 4567", 
maddress{"Main St., Ottawa", "25 Main St.") = 

"25 Main St., Ottawa", 
'"address ("25 Main St.", "25 Main St., Vancouver") = 

"25 Main St., Vancouver". 
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Observe that from Do we can obtain two different (Do,E)- 
clean instances D,D', depending on the order of enforcing 
MDs. 
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J. Doe 
Jane Doe 


(613)123 4567 
(604)123 4567 
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25 Main St., Vancouver 

25 Main St., Vancouver 



Now consider the query Q : 7raddress(a"name=«j. Doe"-R), asking 
for the residential address of J. Doe. We are interested in 
a certain answer. It can be obtained by taking the greatest 
lower bound of the two answer sets: 

Q{D) = {("25 Main St., Ottawa")}, 
Q(D') = {("25 Main St., Vancouver")}. 

In this case, and according to [6], and using Lemma [T] 

glb^{Q{D), Q(D')} = {a X a' \ a G Q{D), a' G Q(D')} 
= {("25 Main St., Ottawa") X ("25 Main St., Vancouver")} 
= {(ff^^-<Ad<jre..'L"25 Main St., Ottawa", "25 Main St., 

Vancouver"})} 

= {("25 Main St.")}. 

We can see that, no matter how we clean Dq, we can say 
for sure that J. Doe is at 25 Main St. Notice that the set- 
theoretic intersection of the two answer sets is empty. If we 



were interested in all possible answers, we could take the 
least upper bound of two answer sets, which would be the 
union of the two in this case. ■ 

We define clean answer to be an upper and lower bound 
of query answers over all possible clean interpretations of a 
dirty database instance. This definition is inspired by the 
same kind of approximations used in the contexts of partial 
and incomplete information |29|[T|. inconsistent databases [3l 
ini [13], and data exchange [2H]. These upper and lower 
bounds could provide useful information about the value of 
aggregate functions, such as sum and count [5] 1181 [2]. 

Definition 6. For a query Q posed to a database in- 
stance -Do and a set of MDs E, a clean answer is specified 
by two bounds as 

CleanaiDo) = (Cert q{Do), Pass q{Do)). ■ 

Notice, from the results in Section|4l that in the case of hav- 
ing similarity-preserving matching functions or non-interacting 
matching dependencies, these bounds would collapse into a 
single set, which is obtained by running the query on the 
unique clean instance. 

Complexity of Computing Clean Answers 

Here we study the complexity of computing clean answers 
over database instances in presence of MDs. As with in- 
complete and inconsistent databases, this problem easily be- 
comes intractable for simple MDs and queries, which moti- 
vates the need for developing approximate solutions to these 
problems. We explore approximate solutions for queries that 
behave monotonically w.r.t. the partial order C in Sec- 
tion [O] 

Theorem 3. There are a schema with two interacting 
MDs and a relational algebra query, for which deciding 
whether a tuple belongs to the certain answer set for an 
instance Do is coNP-complete (in the size of Do). I 

6. MONOTONE QUERIES 

So far we have seen that clean instances are a more infor- 
mative view of a dirty instance obtained by enforcing match- 
ing dependencies. That is. Do C D, for every (Do, E)-clean 
instance D. From this perspective, it would be natural to 
expect that for a positive query, we would obtain a more 
informative answer if we pose it to a clean instance instead 
of to the dirty one. We can translate this requirement into 
a monotonicity property for queries w.r.t. the partial order 
C. 

Definition 7. A query Q is C.-monotone if, for every 
pair of instances D, D', such that D C D', we have Q(D) C 
Q(D'). ■ 

Monotone queries have an interesting behavior when com- 
puting clean answers. For these queries, we can under- 
approximate (over-approximate) certain answers (possible 
answers) by taking the greatest lower bound (least upper 
bound) of all clean instances and then running the query on 
the result. Notice that we are not claiming that these are 
polynomial-time approximations. 



Proposition 3. If I? is a finite set of database instances 
and Q is a C-monotone query, the following holds: 

Qiglb^{D I D € D}) C glbMD) \DeV}, (4) 

lubr{Q{D) I D G ©} C Q{lubr{D \ D e V}). (5) 

■ 

As is well known, positive relational algebra queries com- 
posed of selection, projection, Cartesian product, and union, 
are monotone. However, the following example shows that 
monotonicity does not hold for very simple positive queries 
involving selections. 

Example 6. Consider instance Dq in Example[5]and two 
(Do, E)-clean instances D and D'. Let Q be a query asking 
for names of people residing at "25 Main St.", expressed as 
relational algebra expression ■K„amei(^address="25 Main st."(-R))- 
Observe that Q(Do) = {("J. Doe")}, and Q(D) = S(D') 
= 0. Query Q is therefore not monotone, because we have 
Do C D, Do □ D', but Q(Do) g Q(D), S(Do) g S(D'). 

■ 

It is not surprising that g-monotonicity is not captured by 
usual relational queries, in particular, by queries that are 
monotone w.r.t. set inclusion. After all, the queries we have 
considered so far do not even mention the ^-lattice that is 
at the basis of the g order. Next we will consider queries 
expressing conditions in term of the semantic domination 
lattice. 

6.1 Query relaxation 

As shown in Example [6l we may not get the answer we 
expect by running a usual relational algebra query on an in- 
stance that has been cleaned using matching dependencies. 
We therefore propose to relax the queries, by taking ad- 
vantage of the underlying ^-lattice structure obtained from 
matching functions, to make them g-monotone. In this way, 
we achieve two goals: First, the resulting queries provide 
more informative answers; and second, we can take advan- 
tage of Proposition [3] to approximate clean answers from 
below and from above. 

We introduce the (negation free) language relaxed rela- 
tional algebra, TZA^, by providing two selection operators 
(Ja^A and aAii>^A2 (for comparable attributes Ai,A2), de- 
fined as follows. 

Definition 8. The language TZA^ is composed of rela- 
tional operators vr, x, U (with usual definitions), plus iTo^a, 
and UAi>3^A2, defined by: 

cra^A(D) = {t° \a ^A t^[A]} (here a £ Dotoa), 
o-Ai>4<A2(D) = {t° 1 3a e DoniA s.t. a <a t°[Ai], 

a^At'^lAi], a/-L}. ■ 

For string attributes, for instance, the selection operator 
(Ja^A checks whether the value of attribute A dominates the 
substring a, and the join selection operator aAi>a.;A2 checks 
whether the values of attributes Ai , A2 dominate a com- 
mon substring. Notice that queries in the language TZA^ 
are not domain independent: The result of posing a query 
to an instance depends not only on the values in the active 
domain of the instance but also on the domain lattices. In 
other words, query answering depends on how data cleaning 
is being implemented. 



It can be easily observed that all operators in the language 
TZA^ are C-monotone, and therefore every query expression 
in TZA^ that is obtained by composing these operators is 
also □-monotone. 

Proposition 4. Let Q be a query in TZA^. For every 
two instances D,D' such that D Q D' , we have Q{D) C 

Q{D'). m 

Now suppose that we have a query Q, written in positive 
relational algebra, i.e., composed of tt, x , U, cta^o, o"Ai=A2 > 
the last two being hard selection conditions, which is to be 
posed to an instance Dq. After cleaning Do by enforcing a 
set of MDs E to obtain a (-Do, E)-clean instance D, running 
query Q on D may no longer provide the expected answer, 
because some of the values have changed in D, i.e., they 
have semantically grown w.r.t. ^. In order to capture this 
semantic growth, our query relaxation framework proposes 
the following query rewriting methodology: Given a query 
Q in positive RA, transform it into a query in TZA^ by 
simply replacing the selection operators aA=a and <jai^A2 
by (Ja^A and (JAi«^A2, respectively. 

Example 7. Consider again instance Do in Example [5] 
and (Do, E)-clean instances D and D' , and query Q ask- 
ing for names of people residing at "25 Main St.", expressed 

as TVnam,e{iyaddreaa^"25 Main St."(-R)). We obtaiu the empty aU- 

swer from each of D, D' . So, in this case the certain and the 
possible answers are empty, a not very informative outcome. 

However, after the relaxation rewriting of Q, we obtain 
the query Q-< : Tv„ame{r^"25 Mi^in st."^addressiR))- If we pose 
to the clean instances, we obtain 

Q^{D) = {("John Doe"), ("J. Doe"), ("Jane Doe")}, 
Q^(D') = {("J. Doe"), ("Jane Doe")}, 

and thus CertQ^{Do) = {("J. Doe"), ("Jane Doe")}. This 
outcome is much" more informative; and, above all, is sensi- 
tive to the underlying information lattice. ■ 

Proposition 5. For every positive relational algebra query 
Q and every instance D, we have Q{D) C. Q^{D), where 
is the relaxed rewriting of Q. ■ 

6.2 Approximating Clean Answers 

Given the high computational cost of clean query an- 
swering when there are multiple clean instances, it would 
be desirable to provide an approximation to clean answers 
that is computable in polynomial time. In this section, 
we are interested in approximating clean answers by pro- 
ducing an under-approximation of certain answers and an 
over-approximation of possible answers for a given monotone 
query Q. That is, we would like to obtain {Qi{Do), Q^{Do)), 
such that Qi{Do) C CertQ{Do) and Possq{Do) C Q-f{Do). 

Since Q is a monotone query, by Proposition O we have 
Q{glb^{D I D is (Z)o,S)-clean}) C CertQ{Do), and more- 
over, Possq{Do) C Q{lubc{D I D is (Do, E)-clean}). In 
consequence, it is good enough to find under- and over- 
approximations for the greatest lower bound and the least 
upper bound, resp., of the set of all (-Do, E)-clean instances, 
and then pose Q to these approximations to obtain Q^(Do) 
and Q^{Do). 

The reason for having multiple clean instances is that 
matching dependencies are not necessarily interaction-free 
and the matching functions are not necessarily similarity 



preserving. Intuitively speaking, we can under-approximate 
the greatest lower bound of clean instances by not enforcing 
some of the interacting MDs. On the other side, we can 
over-approximate the least upper bound by assuming that 
the matching functions are similarity preserving. This would 
lead us to keep applying MDs on the assumption that un- 
resolved similarities still persist. We present two chase-like 
procedures to compute Dj, and D^ corresponding to these 
approximations. 

Under-approximating the greatest lower bound. 

To provide an under-approximation for the greatest lower 
bound of all clean instances, we provide a new chase-like 
procedure, which enforces only MDs that are enforced in 
every clean instance. These MDs are applicable to those 
initial similarities that exist in the original dirty instance, 
which are never broken by enforcing other MDs during any 
chase procedure of producing a clean instance. 

Let E be a set of MDs, and yJ, v' G ^- We say that ip 
precedes ifi' if the set of attributes on the left-hand side of ip' 
contains the attribute on the right-hand side of <p. We say 
that ip interacts with ip' if there are MDs ipi, . . . ,Lpk G E, 
such that y precedes ipi, tpu precedes tp' , and ipi precedes 
Pi+i for j £ [1, — 1], i.e., the interaction relationship can 
be seen as the transitive closure of precedence relationship. 

Let Do be a dirty database instance. Let p : R\[X\\ ~ 
R2[X2] Ri[Ai] ^ R2[A2] be an MD in E. We say (p is 
freshly applicable on ti,t2 in Do if tf"[Xi] ~ t^°[X2], and 
t;P°[j4i] 7^ t^''[A2]. We say p is safely applicable on ti,t2 in 
Do if (p is freshly applicable on ti,t2 in Do, and for every 

£ E that interacts with (p, pJ is not freshly applicable on 
ti,t3 or t2,t-i in Do for any tuple tz (see Example IS}. 

Definition 9. For an instance Do and a set of MDs E, 
an instance Dj, is (Do, E)-Mnder clean if there exists a finite 
sequence of instances Di, . . . , Dk-i, such that 

1. For every i G [Di-i, Di)y^i \= for some 

G E and tuple identifiers t\,t\, such that p^ is safely 
applicable on t\,t2 in Do. 

2. For every MD p : Ri[X{\ « R2[X2] Ri[A{\ ^ 
R2[A2] in E and tuples t\,t2, such that p is safely 
applicable on t-i,t2 in Do, we have [Ai] = t^'°[j42]. 

■ 

Definition[9]characterizes a chase-based procedure that keeps 
enforcing MDs that are safely applicable in the original dirty 
instance until all such MDs are enforced. Notice that an 
under clean instance may not be stable. Moreover, safely 
applicable MDs never interfere with each other, in the sense 
that enforcing one of them never breaks the initial similari- 
ties in the dirty instance that are needed for enforcing other 
safely applicable MDs. 

Proposition 6. For every instance Do and every set of 
MDs E, there is a unique (Do,E)-under clean instance D^. 

■ 

Clearly, an under clean instance D4. can be computed in 
polynomial time in the size of the dirty instance Do. To 
construct it, we first need to identify safely applicable MDs 
in Do, and then enforce them in any arbitrary order until 
no such MDs can be enforced. Next we show that D^ is an 



under-approximation to every (-Do, E)-cleaii instance. Intu- 
itively, this is because is obtained by enforcing MDs that 
are enforced in every chase-based procedure of producing a 
clean instance. 

Proposition 7. {Soundness of under-approximation) For 
every (Do, S)-under clean instance Di and every (Do,S)- 
clean instance D, we have _Dj_ ^ D. ■ 

Notice that an arbitrary (Do, S)-clean instance D may not 
be a sound under-approximation for every other (Do,E)- 
clean instances D', because D C D' may not hold. 

Let D_i be a (Do,E)-under clean instance. Then from 
Propositions [7] and [S] we immediately obtain the following 
result. 

Theorem 4. For every monotone query Q, we have 
Q(Do) E Q{Di) E CertaiDo). ■ 

Example 8. Consider the instance Do and set of MDs E 
in Example [l] Observe that MD ipi is safely applicable on 
the first and second tuples in Dq. Moreover, ifi2 is freshly 
applicable, but not safely applicable on the second and third 
tuples. Accordingly, we obtain (Do, E)-under clean instance 
D4,, shown below, by enforcing tpi on the first two tuples. 
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Notice that for the two (Do,E)-clean instances D2,D'^ in 
Example O we have D_|_ C D2 and D^ C D3 . Also notice 
that D_i is not a stable instance. Now consider the query 
Q : nc{o'A=a2R)- This query behaves monotonically for our 
purpose, because the values of attribute A are not changing 
by enforcing MDs. If we pose Q to D^, we obtain Q(D^) = 
{C2}. Observe that CertQ{Do) = {(ci,C2)}, and thus Q{Di) 
provides an under-approximation for CertQ^Do). This ex- 
ample also shows that an arbitrary clean instance, D3 here, 
may not provide a sound approximation to certain answer 
since Q(D^) = {{ci,C2,C3>} g CeriQ(Do). ■ 

Over-approximating the least upper bound. 

To provide an over- approximation for the least upper bound 
of all clean instances, we modify every similarity relation so 
that the corresponding matching function becomes similar- 
ity preserving. For a similarity relation ~yi and the corre- 
sponding matching function ma, we define «^ as follows: 
For every a, a G DomA, a ~^ a' iff there is a" £ DomA, 
such that a ~a a" and mA{a',a") = a'. Given a set of MDs 
E, we obtain E* by replacing every similarity relation ~a 
in the MDs by ^5^;^. 

Definition 10. For an instance Do and a set of MDs E, 
an instance Df is (Do, E)-oj;er clean if Df is (Do, E*)-clean. 

■ 

By Proposition [T] for every instance Do and set of MDs 
E, there is a unique (Do,E)-over clean instance Df, and 
moreover, it can be computed in polynomial time in the 
size of Do. To construct Df, we first need to obtain E*, as 
described above, and enforce MDs in E* in any arbitrary 
order until we get a stable instance w.r.t. E*. Next we 



show that D^ is an over-approximation for every (Do,E)- 
clean instance. Intuitively, this is because D^ is obtained by 
enforcing (at least) all MDs that are present in any chase-like 
procedure of producing a clean instance. 

Proposition 8. {Completeness of over-approximation) 
For every (Do, E)-over clean instance Df and every (Do, E)- 
clean instance D, we have D CI Df. ■ 

Notice again that an arbitrary (Do, E)-clean instance D may 
not be an over-approximation for every other (Do,E)-clean 
instance D', because D' C D may not hold. 

Let Df be a (Do, E)-over clean instance. Then from Propo- 
sitions [S] and |3] we immediately obtain the following result. 

Theorem 5. For every monotone query Q, we have 
Possq{D„) C Q{Df). m 

Example 9. (Example [8] continued.) By assuming that 
old similarities hold after applying matching functions (e.g., 
(61, 62) ~* ^3), we obtain the (Do, E)-over clean instance Df 
shown below. 
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Notice that for the two (Do, E)-clean instances D2, D3 in Ex- 
ample |4l we have D2 CI Df and D3 C Df . If we pose query 
Q : ■Kc{(^A=a2R) to Df, we obtain Q{Df) = {(ci, C2, cs)}. 
Observe that Possq{Do) — {(ci, C2, C3)}, and thus Q{Df) 
provides an over-approximation for Possq{Do). It can be 
seen that an arbitrary (Do, E)-clean instance, say D2 for in- 
stance, may not provide a complete approximation to pos- 
sible answer since Possq{Do) g Q(D2) = {(ci,C2)}. I 

7. A CASE FOR SWOOSH'S ENTITY 
RESOLUTION 

In [8], a generic conceptual framework for entity resolu- 
tion is introduced. It considers a general match relation M, 
which is close to our similarity predicates «, and a gen- 
eral merge function, /i, which is close to our m functions. 
In this section we establish a connection between our MD 
framework and Swoosh. However, a full comparison is prob- 
lematic, for several reasons, among them: (a) Swoosh works 
at the record (tuple) level, and we concentrate on the at- 
tribute level, (b) Swoosh does not use tuple identifiers and 
some tuples may be discarded at the end, those that are 
dominated by others in the instance. The main problem is 
(a). However, to ease the comparison, we consider a partic- 
ular (but still general enough) case of Swoosh, namely the 
combination of the union case with merge domination. In 
the following we embed this case of Swoosh into our MD 
framework, thus showing the power of the latter. 

Although it is not explicitly said in [S], it is safe to say 
that the conceptual framework is applied to ground tuples of 
a single relational predicate, say R, which are called records 
there. In consequence, Rec denotes the set of ground tuples 
of the form R{s). If the attributes of 7? are Ai, . . . , An, 
then the component s; of s belongs to an underlying domain 
DomAi . 

Relation M maps Rec x Rec into {true, false}. When two 
tuples are similar and have to be merged, M takes the value 



true. Moreover, is a partial function from Rec x Rec into 
Rec. It produces the merge of the two tuples into a single 
tuple, and is defined only when M takes the value true. 

Now, the union case for Swoosh arises when the merge 
function /i produces the union of the records, defined as the 
component-wise union of attribute values. This latter union 
makes sense if the attribute values are sets of values from 
an even deeper data domain. 

More precisely, for each of the n attributes At of R, we 
consider n possibly denumerable domains Da^. (Repeated 
attributes in R share the same domain, but it is concep- 
tually simpler to assume that attributes are all different.) 
Each Da; has a similarity relation ~Ai, which is reflexive 
and symmetric. Now, for each attribute Ai of R, its domain 
becomes DomA, ■= UkenV'iDAi), where fc > and P*" (Da J 
denotes the set of subsets of Da^ of cardinality fc. In con- 
sequence, the elements of Rec are of the form i?(si, . . . , s^), 
with each Si being a set that belongs to DoruAi . An initial 
instance D, before any entity resolution, will be a finite sub- 
set of Rec, and each attribute value in a record, say s; for 
Ai, will be a singleton of the form {ai}, with o; € Da^ . 

The ~Ai relation on Da^ induces a similarity relation 
~{Ai} on DoruAi, as follows: si ~{Ai] S2 holds iff there 
exist ai £ S\,a2 £ S2 with ai ~Aj 0-2. Each ~{Ai} is reflex- 
ive and symmetric, (s ~{Ai} s, because there is a € s and 
~Ai is reflexive; and symmetry follows from the symmetry of 
~A; ■) We also consider matching functions m{Ai} : DomA^ x 
DoruAi — > DorriAi deflned by m{Ai}(si, •S2) ■= s\ U S2. The 
structures {DomA^, ~{Ai}, "{A;}) have all the properties de- 
scribed in Sections [2] and |3l 

Proposition 9. Each matching function m^Ai} is total, 
idempotent, commutative and associative. It is also similar- 
ity preserving w.r.t. the ~{Ai} similarity relation. ■ 

Now, based on [5], we are ready to deflne the "union match 
and merge case" for Swoosh. Consider two elements of Rec, 
say ri = R{s^),r2 ~ R(s^): (a) M(ri,r2) := true iff for 
somei, ~{Ai} s?- (b) When M(ri, 7-2) :— true, fi{ri,r2) .— 

-R("'{Ai}(Sl, Si), . . . , m{A„}(sL S„)). 

Function M is reflexive and commutative, which follows 
from the reflexivity and symmetry of the ~{a}- From [§1 
Prop. 2.4] we obtain that the combination of M and has 
Swoosh's ICAR properties, namelyO 

Idempotency: Vr £ Rec, M{r,r) holds, and ^i{r,r) = 
r. 

C: Commutativity: Vri,r2 £ Rec,M{ri,r2) iff M{r2,ri). 
Also M(ri,r2) implies /i(ri,r2) = /i(r2,ri). 

A": Associativity: Vri,r2,r3 £ Rec, if /i(ri, ^(r2, ra)) and 
H{n{ri,r2),r3) exist, then they are equal. 

R": Representativity: Vri,r2,r3,r4 £ Rec, if ra = /i(ri,r2) 
and M{ri,r4) holds, then M{rs,r4) also holds. 

Now, Swoosh framework with M and /i on DoruA can be 
reconstructed by means of the following set of MDs; For 
1 < i,i < 

R[A,] «{Aa R[A,] ^ R[A,]. (6) 

^We use the superscript s, for Swoosh, to distinguish them 
from the properties listed in Section [S] 



The RHS of (j6]) has to be applied, as expected, with the 
matching functions »t{Aj } • From Propositions [1] and |9l we 
obtain that there is a single (D, E^)-clean instance . 
Consistently with our MD framework, we will assume that 
records have tuple identiflers. Actually, in order to make 
more clear the comparison between the two frameworks, in 
this section and for MDs, we will use explicit tuple ids. They 
will be positioned in the first, extra attribute of each rela- 
tion. When the MDs are applied, only the new version of a 
tuple is kept. 

In the case of Swoosh, the application of generates a 
new, merged tuple, but the old ones may stay. However, 
Swoosh applies a pruning process based on an abstract dom- 
ination partial order between records, . The framework 
concentrates mostly on the merge domination relation <, 
which is defined by: 

ri<r2 ■.<^=>- M{ri,r2) — true and ii{ri,r2) — r2. (7) 

The PCA^R" properties make < a partial order with some 
pleasant and expected monotonicity properties [8]. 

According to Section O we may consider each of the par- 
tial orders ^{ A;} on the DomA; : S ^{Ai} s' ■'^ 'n{Ai}(s, s') = 
s'. They induce a ^ relation on Rec (cf. Definition [2]) . 

Proposition 10. The general dominance relation ^ on 
Rec coincides with the merge domination relation < ob- 
tained from M and p. ■ 

Given a dirty instance D, it is a natural question to ask 
about the relationship between the clean instance ob- 
tained under our approach, by enforcing the above MDs, 
and the entity resolution instance D" obtained directly via 
Swoosh. The entity resolution D" is defined in [S] Def. 2.3] 
through the conditions: 1. D'' C D. 2. D < D" . 3. D' 
is C-minimal for the two previous conditions. Here, the 
partial-order < between instances is induced by the partial 
order < between records Eis in Definition [2] Instance D is 
the merge closure of D, i.e., the C-minimal instance that in- 
cludes D and is closed under M : ri, 7-2 £ D and M(ri, 7-2) = 
true /i(ri, r2) £ -D. 

Notice that, in order to obtain 73™, tuple identifiers are 
introduced and kept, whereas under Swoosh, there are no tu- 
ple identifiers and new tuples are generated (via and some 
are deleted (those <-dominated by other tuples). In conse- 
quence, the elements of D and under the MD framework 
are of the form R{t, si, . . . , s„), and those in D and D" under 
Swoosh are the records r of the form R(si, . . . , s„). Since t 
is a tuple identifier, for every R(t, si, . . . , Sn), r{t) denotes 
the record R{si, . . . , s„). 

Proposition 11. (a) For every r in D" there is a tuple 
in with tuple identifier t, such that r{t) = r. 
(b) For every tuple t £ D"*, there is a record r £ D'' , such 
that r(t) <r. ■ 

8. CONCLUSIONS 

The introduction of matching dependencies (MDs) in [16] 
has been a valuable addition to data quality and data clean- 
ing research. They can be regarded as data quality con- 
straints that are declarative in nature and are based on a pre- 
cise model-theoretic semantics. They are bound to play an 
important role in database research and practice, together 
(and in combination) with classical integrity constraints. 



In this work we have made several contributions to the 
semantics of matching dependencies. We have refined the 
original semantics introduced in [T7], addressing some im- 
portant open issues, but we have also introduced into the 
semantic framework the notion of matching function. This is 
an important ingredient in entity resolution since matching 
functions indicate how attribute values have to be merged 
or identified. 

The matching functions, under certain natural assump- 
tions, induce lattice-theoretic structures in the attributes' 
domains. We also investigated their interaction with the 
similarity relations in those same domains. This led us to 
introduce a partial order of domination between instances. 
This allows us to compare them in terms of information 
content. This same notion was then applied to sets of query 
answers. 

On the basis of all these notions, we defined the class of 
clean instances for a given dirty instance. They are the in- 
tended and admissible instances that could be obtained after 
enforcing the matching dependencies. The clean instances 
were defined by means of a chase-like procedure that enforces 
the MDs, while not making unjustified changes on other at- 
tribute values. The chase procedure stepwise improves the 
information content as related to the domination order. 

The notion of clean answer to a query posed to the dirty 
database was defined as a pair formed by a lower and an 
upper bound in terms of information content for the query 
answers. In this context we studied the notion of monotone 
query w.r.t. to the domination order and how to relax a 
query into a monotone one that provides more informative 
answer than the original one. 

The domination-monotone relational query language in- 
troduced uses the lattice-theoretic structure of the domain, 
and is interesting in its own right. It certainly deserves fur- 
ther investigation, independently of MDs. It is an interest- 
ing open question to explore its connections with querying 
databases over partially ordered domains, with incomplete 
or partial information |30l 1221 [27l 126) . with query relaxation 
in general [251 119j , and with relational languages based on 
similarity relations [23j . 

We addressed some problems around the enforcement of 
a set of matching dependencies for purposes of data clean- 
ing based on the original proposal of [161 117[ , by explicitly 
making use of matching functions. We studied issues such 
as the existence and uniqueness of clean instances, the com- 
putational cost of computing them, and the complexity of 
computing clean answers. We identified cases where clean 
query answering is tractable, e.g., when there is a single 
clean instance. However, we established that this problem 
is intractable in general. We proposed polynomial time ap- 
proximations. 

Identifying other tractable cases and more efficient ap- 
proaches to the intractable ones is part of ongoing research. 
We are currently investigating the use of logic programs with 
stable model semantics in the specification of clean instances 
and in clean query answering. This idea has been investi- 
gated in consistent query answering and has led to useful 
insights and implementations [H (201 El Ull III] • This route 
would avoid the explicit computation of the clean instances, 
and clean query answering could be done on top of the pro- 
gram. However, the lattice-theoretic structure of the do- 
mains and the domination order create a scenario that is 
substantially different from the one encountered in database 



repairs w.r.t. classical integrity constraints. 
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APPENDIX 

A. SOME PROOFS 

Proof of Lemma [ij (1) Let D be the instance Di Y D2. 
Clearly, D\ \— D and D2 ^ D. Now let D' be an arbitrary 
instance such that Di C D' and D2 C D' , and let t be a 
tuple in D. Then, by definition, t is in Di or in D2, and 
hence there should be a tuple t' in D' such that t^ ^ t'^ . 
Therefore, we have D \Z D' , and thus D is the least upper 
bound of D\ , D2 ■ 

(2) Let t be the tuple t\ X t2- Clearly, t < if ^ and t ^ t^^. 
Let t' be an arbitrary tuple such that t' ^t-^^ and ^2 ^ ■ 
Then t'[A] < tf^^j and t'[A] < t^^-[A] for every attribute 
A in the schema. Thus, t'[A] < glb{t°^ [A],t°^ [A]) for every 
attribute A, and hence t' <t. 

(3) Let D be the instance D\ X D2. Let f be a tuple in D. 
Then there exist tuples ti in Di and t2 in D2, such that 
t = ti X t2, and thus t° ^ tf^ and t° ^ Therefore, it 
holds 7? □ Di and D C D2. 

Let D' be an arbitrary instance such that D' C Di and 



D' C D2, and let t' be a tuple in D' . Then there exist tuples 
ti in Di and t2 in D2, such that t'°' ^ tf ^ and t'°' ^ t^^, 

and thus t'^ ^ gl'b{ti ^ ,t2 which exists in D. We thus 
have D' \ZD. ■ 

Proof of Theorem [2j (sketch) It is easy to see that for 
every i £ [l,k], we have -Di-i C Di and Di % Di-\. That 
is, Di strictly dominates Di-\ (for simplicity, assume that 
the new generated tuples are not completely dominated by 
other tuples). Consider a database instance Dmai that has 
a single tuple in every relation, for which the value of every 
attribute A is the result of matching all values of A (and 
other attributes comparable with A) in the active domain 
of Do- Clearly, D^ax provides an upper bound for the in- 
stances in the sequence, and thus the sequence stops after 
finite number of steps. Furthermore, the number of match- 
ing applications needed to reach Dmax is polynomial in the 
size of Dq. ■ 

Proof of Proposition [TJ Let D,D' be two (Do, S)-clean 
instances. We use two lemmas. 

Lemma 2. Let mA be a similarity preserving function, 
and ai, 02, as, 04 be values in the domain DomA, such that 
ai ^ a:i and 02 ^ 04,. If ai ~a 12, then 03 ~a 04,. 

Lemma 3. Let D\, . . . , Dk be a sequence of instances such 
that D ^ Dk, and for every i £ [1, k], (A-i, A)[ti \= v'l 
for some (p^ £ and tuple identifiers t\,t\. Then for every 
i £ [0, fc], the following holds: 

1. tf'[Ai] <t?'[A^] , for every tuple identifier t\ and every 
attribute A\. 

2. if tf"[^i] ~ t'^'[A2], then t?' [Ax] ^ t^'[A2], for ev- 
ery two tuple identifiers ti,t2 and two comparable at- 
tributes Ai,A2. 

We prove Lemma |3] by an induction on i. For i = 0, we 
clearly have tf"[^i] ^ tf [Ai] since D' is a (Do, S)-clean 
instance. Moreover, if tf"[Ai] tf"[A2], then tf'[Ai] « 
t^'[A2] by Lemma [2] 

Suppose 1 and 2 hold for every i < j. If 1 holds for i = j, 
then 2 also holds for i = j hy Lemma[21 Suppose 1 does not 
hold for i = j: tf^ [Ai] ^ ' [Ai]. Since 1 holds for every i < 

j, the value of t^ ^ [Ai] should be different from t-^ [^i]- 
Therefore, there should be an MD ip^ : Ri[Xi] ^ R2[X2] ^■ 
^ -R2[^2] in E and a tuple identifier t2, such that 
Dj is the immediate result of enforcing ifi^ on ti, t2 in Dj-i. 

That is, tf'-'[X,] ^ t!^'-'[X2], tf'-'[Aj] / t°'-'[A2], 
and tf'[Ai] = t2'[A2] = mA{tf'-'[Ai],t!^'-'[A2]). Since 
tj'^''[Xi] ~ M^a], by induction assumption, we have 
ti'[Xi] « tf'[X2], and thus, tf'[74i] = ti' [A2], because 
D' is a stable instance. Again by induction assumption, 
tf'-'[Ai] r< t?'[Ai] and t°'-'[A2] :< t^' [A2] = tf [Ai]. 

Therefore, tf'[Ai] = mA{tf'-' [Ai],t°'-' [A2]) ^ t?' [Ai] 
since mA takes the least upper bound, which leads to a con- 
tradiction. 

To prove the first part of Proposition [1] notice that, from 
LemmaE we obtain tf [Ai] ^ tf' [Ai] and ti'[Ai] ^ tf [Ai] 
for every tuple identifier ti and every attribute Ai. Thus, 
the two (_Do, E)-clean instances D,D' should be identical. 



To prove the second part of the proposition, let : Ri [Xi] ~ 
R2[X2] Ri[Ai] ^ R2[A2] be an MD in S. By Lemma [1 
if tf°[Xi] f» t^f fXa], then t?[Xi] « if [X2], for every two 
tuple identifiers i 1 , i2 . Since 73 is a stable instance, [^1] = 
t^[A2], and thus {Do,D) \= ip and [Do, -D) ^ E. ■ 

Proof of Proposition [2] Let D,D' be two (Do, E)-clean 
instances. It is enough to prove a lemma similar to Lemma[3l 

Lemma 4. Let Di, . . . , Dk be a sequence of instances such 
that D = Dk, and for every i € [1, fc], (A-i, A)[ti_ti] ^ Y^N 
for some £ E and tuple identifiers t\,t\. Then for every 
i G [0, k], it holds 

1. tf'[Xi] = tf«[Xi] andtf'fXa] = tf" [X2] , where Xi , X2 
are lists of attributes on the left-hand side of ip^ . 

2. tf"[^i] r<tf [Ai],for every tuple identifier ti and every 
attribute A-i. 

Notice that 1 trivially holds: since MDs are interaction free, 
there is no MD G E, such that the attributes on the right- 
hand side of ip has an intersection with X\ , X2 , and therefore 
no MD enforcement could change the values in ' [X 1] or 
' [X2] into something different from the original values in 
Do. 

We prove 2 by an induction on i. For i = 0, we clearly 
have ^ tf [Ai] since D' is a (-Do, E)-clean instance. 

Now suppose 2 holds for i < j, and it does not hold for 
i = j: tf'[Ai] 2< tf'[^i]. Then there should be an MD 
ip^ : Ri[Xi] « R2IX2] Ri[Ai] ^ R2[A2] in E and a 
tuple identifier t2, such that Dj is the immediate result 

of enforcing ip-' on ti,t2 in Dj-i. That is, ti^~''[Xi] « 
t°'-' [X2], tf'-' [Ai] ^ t°'-' [A2], and if^ [^i] = t^' [A2] = 
n,A{tf'-'[Ai],t2'-'[A2]). Since tf^-^[Xi] ^tf'-'[X2], by 
part 1 we have tf'[Xi] « t?'[X2], and thus tf'[yli] = 
t2 [A2], because D' is a stable instance. By induction as- 
sumption, tf'-'[Ai] :< t?'[Ai] and t^'^'lAi] ^ t?'[A2] = 
t?'[Ai]. Therefore, tf'[Ai] = ™A(t^~M^i], if '"'[^2]) :< 
if [Ai], since ma takes the least upper bound, which leads 
to a contradiction. ■ 

Proof of Theorem[3) Consider relation schema R{C, V, L), 
query Q : TrL{R), and set E consisting of two MDs ipi : 
R[C] ^ R[C] R[C] ^ R[C] and (^2 : R[CV] « R[CV] 
R[L] ^ R[L]. The domains of attributes, similarity re- 
lations, and matching functions are as follows: Dome ~ 
{_L, c, ci, di, C2,d2 . . .}, Domv = {-L, y,xi,X2, . . .}, Dotul = 
{-L, T, +, — }. For every Ci,di € Dome, Ci ^ di and mc{ci,di) — 
mc{di,Ci) — c. We also have mL{+,—) = mL{ — ,+) = T. 
Notice that similarity relations and match functions are not 
fully described here. The full descriptions can be derived 
using the reflexivity and symmetry of similarity relations 
and idempotency, commutativity, and associativity of match 
functions. 

To prove membership in coNP, it is easy to see that given 
a certificate, which is a (Do,E)-clean instance D, we can 
check whether T Q{D) in polynomial time. To prove 
hardness, we reduce from 3SAT. Let Ci A . . . A Cjv be CNF 
formula, where each clause d, i € [1, N], is a. disjunction of 
three literals hi V li2 V h^, and each literal lit, k € [1,3], is 
either Xj or ^Xj for some variable Xj £ Domv- We create 



an instance Do of R as follows. For every clause d and 
every literal hk of variable Xj in d, there is a tuple t with 
t°°[C] = c, t^°[V] = Xj, t^°[L] = + if Uk = Xj (a positive 
literal), and t^''[L] = — if Uh = -iXj (a negative literal). 
Moreover, for every clause d, there is another tuple t with 
t°°[C] = d„ t°°[V] = y, and i^n[L] = +. 

We show that the CNF formula C is satisfiable if and 
only if T CertQ^Do). Let C be a satisfiable formula. For 
each clause d, we pick a tuple corresponding to the literal 
that is made true in the satisfying assignment and also the 
only tuple with t^°[C] = di, and enforce the MD ipi on 
these two tuples. It is easy to see that the result would be 
a stable instance D. In particular, (D, D) |= ip2 since for 
each variable the satisfying assignment has picked only one 
of the positive or negative literals to be true. Therefore, we 
do not need to enforce ip2, which means that T does not 
appear for any value of attribute L, and hence T ^ Q{D), 
and T CertQ{Do). 

Conversely, if T CertQ{Do), there is a (Do,E)-clean 
instance D in which T does not appear for any value of at- 
tribute L. To obtain the clean instance D starting from Do, 
we need to enforce ipi once for each clause d, as described 
above, before we can enforce ^2 on any tuple corresponding 
to d. Moreover, for every two tuples in D that match the 
left-hand side of ip2, we should have identical values for at- 
tribute L (either -I- or — ), otherwise we would get T when 
enforcing ip2. Therefore, for each clause, we can make true 
the literal corresponding to the tuple on which ipi has been 
enforced, and obtain a correct satisfying assignment. ■ 

Proof of Proposition [3) For every instance D' £ V, we 
clearly have glb^{D \ D G 2?} C D', since Q is a mono- 
tone query, it holds Q{glh^{D | D e D}) C Q(D'). Con- 
sequently, Q{glb^{D I D e O}) C glb^{Q{D') \ D' £ V]. 
With a similar argument, we can show that the second equa- 
tion of Proposition [3] holds. ■ 

Proof of Proposition |4) (sketch) We can prove the propo- 
sition by an structural induction on the relational algebra ex- 
pression. It is enough to show that every operation in TZA< 
is monotone. Projection, Cartesian product, and union are 
clearly monotone operators w.r.t. C. Now let D,D' be two 
instances such that D Q D' . Consider query Q : Oa^AR 
for relation R in the schema. Let t be an _R-tuple in Q(D). 
Clearly t is an ii-tuple in D. Therefore, there is an 7?-tuple 
t' in D' with t X t' . Now it holds a < t'^[A] < t'^' [A], and 
thus t' is in Q(D'). 

Now consider the query Q' : 0-^1x1^X2^?, and let t be an 
i?-tuple in Q'(D). Then there is a e DomA s.t. a ^ 
a ^ t^[A2], and a 7^ ±. Since D C D', there should be an R- 
tuple t' in D' with t^t'. Now it holds a<t^ [Ai] ^ t'^' [Ai] 
and a ■< t°[A2] ■< t'°' [A2\. Therefore, t' is in Q(D'). 

■ 

Proof of Proposition [6) The proof of this proposition 
is very similar to that of Proposition [2l Let D^,D^ be two 
(Do,E)-under clean instances. It is enough to prove the 
following lemma. 

Lemma 5. Let Di, . . . , Dk be a sequence of instances for 
deriving D4, as described in Definition [9l Then for every 
i G [0, k] , it holds 



1. t"' [Xi] = t"° [Xi] and t^' [X2] = [X2] , where Xi , X2 
are lists of attributes on the left-hand side of 

2. t^nM^h'lAi], for every tuple identifier ti and ev- 
ery attribute Ai. 

Suppose that for some i e [0, fc], / t'['°[Xi]. Then 

there exists j < i, tuple ts, and MD if-' € E, such that 
{Dj-i, 1= 1/7-', with attribute Bi £ Jti on the right- 

hand side of ifi-' . MD ip-' has to be safely applicable on ti, ts 
in Do, which means that ip' cannot be safely applicable on 
ti,t2 in Do, a contradition. The proof of part 2 is similar to 
the proof of part 2 in Lemma ■ 

Proof of Proposition [7) Let Di be a {Do, E)-under clean 
instance, and D be a (Do, E)-clean instance. The proof fol- 
lows from the following two lemmas. 

Lemma 6. For every two tuples ti , t2 and MD (p : Ri [Xi] r 
R2[X2] i?2[^2] in E, such that ip is safely 

applicable on ti,t2 in Do, it holds tf'[Xi] = fj^^lXi] and 

t?lX2]^t^"[X2]. 

Lemma 7. Let Di , . . . , Dk be a sequence of instances for 
deriving D4, as described in Definition [O] Then for every 
i e [0, k], it holds tf'[Ai] ^ tP[Ai], for every tuple identifier 
ti and every attribute Ai. 

The proof of lemma is by induction on i and is very similar 
to the proof part 2 in Lemma U) 

■ 

Proof of Proposition [9) In fact: If si ~{a} S2, then there 
are oi £ Si,a2 £ S2 with ai «a a2. Since 02 also be- 
longs to S2 U S3, for every S3 £ DoruA, it holds S2 U S3 = 

m{A}(s2, S3) ~{A} Si- ■ 

Proof of Proposition llOt For :<{a} on DomA it holds: 
s ^{A} s' :<=> m{A}(s, s') = s' <t^> s U s' = s' sCs'. 
Now, for records ri = _R(s}, . . . , s^), r2 = R{sl, . . . , sf^), it 
holds ri ^ r-2 :<=> for every i, sj ^{a} •s? <^ for every i, s] C 

On the other side, from ((Tjl we obtain that, for records 
ri — R{sl, . . . , sjj), r2 = R{si, . . . , s^), it holds: ri < r2 <^ 
A4{ri,r2) = true and for every i, S; C sf. Since the S; are 
non-empty, the first condition on the RHS is implied by the 
second one. ■ 



Conversely, every enforcement of an MD in (jB} is dom- 
inated by a tuple obtained through the application of p. 
More precisely, for tuples R{ti,s^),R{t2,s'^) in an instance 
D for which sj ~{Aj} holds, it also holds M{ti{r),t2{r)), 

and tinAj}{R{ti,s^), R{t2,s^)) < R{t, fi{ti{r),t2{r)) ior some 
tuple id t (actually, ti or t2). 

Now, for (a), consider D™; := {r{t) \ R{t,s) £ D™} 
(from where duplicates are eliminated). It is good enough 
to prove that D" C D^ll-- For this it suffices to prove that 
D'"4- satisfies conditions 1. and 2. on the entity resolution 
instance, namely: 1. D'^l C D and 2. D < D™|. The 
first condition follows from the definition (or construction) 
of D™ as a stable instance obtained by minimally applying 
the MDs and when justified only. The second condition 
follows from the simulation and properties of /x as a finitely 
long enforcement of the MDs. 

Now (b) follows from the domination of a tuple obtained 
by applying one MD by a tuple obtained applying fi as de- 
scribed above. ■ 



Proof of Proposition lilt (sketch) As a preliminary and 
useful remark, let us mention that the PCA^R" proper- 
ties make < a partial order with the following monotonicity 
properties [Hj: (A) M(ri,r2) = true =^ r\ < ^.{r\,r2) and 
r2 < /x(ri,r2). (B) r\ < r2 and Af(r-i,r) = true 
M{r2,r) — true. (C) ri < 7-2 and M{ri,r) = true 
li{ri,r) < ij.{r2,r). (D) ri < s, r2 < s and M(ri, r2) = 
true =4> fJ,{ri,r2) < s. 

More specifically for our proof, first notice that every ap- 
plication of jj, can be simulated by a finite sequence of en- 
forcement of the MDs in ([SJ. More precisely, given two tu- 
ples R{ti, s^),R(t2, s'^) in an instance D, such that M{r{ti), 
r{t2)) holds, then jj,{r{ti),r{t2)) = r{t) for some tuple R{t, 
r{t)) of the form m^A; } ■ ■ ■ ™{Ai„ } (-R(ii, s^), -R(i2, s^)), i.e., 
obtained by enforcing the MDs. Furthermore, it holds r{ti) 
< r{t) and r{t2) < r{t). 



